Missing data and multiple imputation

MACS 30200
University of Chicago

May 10, 2017

Causes of missingness

  • Surveys
  • Errors in data collection
  • Intentional
  • Censored values

Patterns of missingness

  • Missing completely at random (MCAR)
  • Missing at random (MAR)
  • Missing not at random (MNAR)
  • Why do we care?
    • Mechanism
    • Ignorable vs. non-ignorable

Things to consider

  1. Does the method provide consistent estimates of the population parameters?
  2. Does the method provide valid statistical inferences?
  3. Does the method use the observed data efficiently or does it recklessly discard information?

Complete-case analysis

  • Listwise deletion
  • Advantages
  • Disadvantages

Available-case analysis

  • Pairwise deletion
  • Advantages
  • Disadvantages

Imputation

  • Imputation
  • Unconditional mean imputation
  • Conditional mean imputation

Maximum-likelihood estimation

\[p(\mathbf{X}, \theta) = p(\mathbf{X}_{\text{obs}}, \mathbf{X}_{\text{mis}}; \theta)\]

  • Data MAR

    \[p(\mathbf{X}_\text{obs}; \theta) = \int{p(\mathbf{X}_{\text{obs}}, \mathbf{X}_{\text{mis}}; \theta)} d\mathbf{X}_{\text{mis}}\]

  • Closed-form solution

Expectation-maximization (EM) algorithm

  1. Find the expectation of the complete-data log-likelihood

    \[E[\log_eL(\theta; \mathbf{X})| \theta^(l)] = \int{\log_e L(\theta; \mathbf{X} p(\mathbf{X}_\text{mis} | \mathbf{X}_\text{obs}, \theta^{(l)})) d \mathbf{X}_\text{mis}}\]
  2. Find the values \(\theta^{(l+1)}\) of \(\theta\) that maximize the expected log-likelihood \(E[\log_eL(\theta; \mathbf{X})| \theta^(l)]\)

Bayesian multiple imputation

  • Multivariate normal distribution
  • Multiple imputation
  • Account for uncertainty within and across the datasets
  • Bayesian method

Conducting inference

\[\tilde{\beta}_j \equiv \frac{\sum_{l=1}^g B_j^{(l)}}{g}\]

Conducting inference

\[\tilde{\text{SE}}(\tilde{\beta}_j) = \sqrt{V_j^{(W)} + \frac{g + 1}{g} V_j^{(B)}}\]

\[V_j^{(W)} = \frac{\sum_{l=1}^g \text{SE}^2(B_j^{(l)})}{g}\]

\[V_j^{(B)} = \frac{\sum_{l=1}^g (B_j^{(l)} - \tilde{B}_j)^2}{g-1}\]

\[\text{SE}^2(B_j^{(l)})\]

Practical considerations for multiple imputation

  • Which variables to include
  • Transform variables to approximately normal
  • Adjust the imputed data to resemble the original data
  • Make sure the imputation model captures relevant features of the data
  • \(g\) doesn’t need to be large

Infant mortality

Regression model

##                term estimate std.error statistic  p.value
## 1       (Intercept)   6.8840   0.29039     23.71 1.58e-31
## 2 log(GDPperCapita)  -0.2943   0.05765     -5.10 3.85e-06
## 3     contraception  -0.0113   0.00424     -2.66 1.01e-02
## 4   educationFemale  -0.0770   0.03378     -2.28 2.63e-02

Missingness

infantMortality GDPperCapita contraception educationFemale
6 10 63 131

Amelia

library(Amelia)
un.out <- amelia(as.data.frame(un), m = 5, idvars = c("country", "region"))
## Warning: There are observations in the data that are completely missing. 
##          These observations will remain unimputed in the final datasets. 
## -- Imputation 1 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
## 
## -- Imputation 2 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
##  61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
##  81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98
## 
## -- Imputation 3 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
## 
## -- Imputation 4 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
##  61 62 63 64 65 66
## 
## -- Imputation 5 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
##  61 62 63 64 65 66 67 68 69 70
## List of 5
##  $ imp1:'data.frame':    207 obs. of  14 variables:
##   ..$ country               : chr [1:207] "Afghanistan" "Albania" "Algeria" "American.Samoa" ...
##   ..$ region                : chr [1:207] "Asia" "Europe" "Africa" "Asia" ...
##   ..$ tfr                   : num [1:207] 6.9 2.6 3.81 1.86 NA ...
##   ..$ contraception         : num [1:207] -12 71.9 52 43.3 NA ...
##   ..$ educationMale         : num [1:207] 4.05 11.02 11.1 13 NA ...
##   ..$ educationFemale       : num [1:207] 0.319 10.646 9.9 12.313 NA ...
##   ..$ lifeMale              : num [1:207] 45 68 67.5 68 NA ...
##   ..$ lifeFemale            : num [1:207] 46 74 70.3 73 NA ...
##   ..$ infantMortality       : num [1:207] 154 32 44 11 NA 124 24 22 25 6 ...
##   ..$ GDPperCapita          : num [1:207] 2848 863 1531 3207 NA ...
##   ..$ economicActivityMale  : num [1:207] 87.5 78.3 76.4 58.8 NA ...
##   ..$ economicActivityFemale: num [1:207] 7.2 68.9 7.8 42.4 NA ...
##   ..$ illiteracyMale        : num [1:207] 52.8 8.941 26.1 0.264 NA ...
##   ..$ illiteracyFemale      : num [1:207] 85 16.02 51 0.36 NA ...
##   ..- attr(*, "spec")=List of 2
##   .. ..- attr(*, "class")= chr "col_spec"
##  $ imp2:'data.frame':    207 obs. of  14 variables:
##   ..$ country               : chr [1:207] "Afghanistan" "Albania" "Algeria" "American.Samoa" ...
##   ..$ region                : chr [1:207] "Asia" "Europe" "Africa" "Asia" ...
##   ..$ tfr                   : num [1:207] 6.9 2.6 3.81 3.39 NA ...
##   ..$ contraception         : num [1:207] 17.3 50.8 52 19.8 NA ...
##   ..$ educationMale         : num [1:207] 7.2 7.72 11.1 12.77 NA ...
##   ..$ educationFemale       : num [1:207] 2.11 9.37 9.9 12.76 NA ...
##   ..$ lifeMale              : num [1:207] 45 68 67.5 68 NA ...
##   ..$ lifeFemale            : num [1:207] 46 74 70.3 73 NA ...
##   ..$ infantMortality       : num [1:207] 154 32 44 11 NA 124 24 22 25 6 ...
##   ..$ GDPperCapita          : num [1:207] 2848 863 1531 8316 NA ...
##   ..$ economicActivityMale  : num [1:207] 87.5 90.1 76.4 58.8 NA ...
##   ..$ economicActivityFemale: num [1:207] 7.2 60.4 7.8 42.4 NA ...
##   ..$ illiteracyMale        : num [1:207] 52.8 2.955 26.1 0.264 NA ...
##   ..$ illiteracyFemale      : num [1:207] 85 3.52 51 0.36 NA ...
##   ..- attr(*, "spec")=List of 2
##   .. ..- attr(*, "class")= chr "col_spec"
##  $ imp3:'data.frame':    207 obs. of  14 variables:
##   ..$ country               : chr [1:207] "Afghanistan" "Albania" "Algeria" "American.Samoa" ...
##   ..$ region                : chr [1:207] "Asia" "Europe" "Africa" "Asia" ...
##   ..$ tfr                   : num [1:207] 6.9 2.6 3.81 3.35 NA ...
##   ..$ contraception         : num [1:207] -7.24 56.17 52 67.65 NA ...
##   ..$ educationMale         : num [1:207] 6.16 9.37 11.1 14 NA ...
##   ..$ educationFemale       : num [1:207] 3.58 10.19 9.9 13.66 NA ...
##   ..$ lifeMale              : num [1:207] 45 68 67.5 68 NA ...
##   ..$ lifeFemale            : num [1:207] 46 74 70.3 73 NA ...
##   ..$ infantMortality       : num [1:207] 154 32 44 11 NA 124 24 22 25 6 ...
##   ..$ GDPperCapita          : num [1:207] 2848 863 1531 3568 NA ...
##   ..$ economicActivityMale  : num [1:207] 87.5 78.9 76.4 58.8 NA ...
##   ..$ economicActivityFemale: num [1:207] 7.2 63 7.8 42.4 NA ...
##   ..$ illiteracyMale        : num [1:207] 52.8 1.728 26.1 0.264 NA ...
##   ..$ illiteracyFemale      : num [1:207] 85 14.92 51 0.36 NA ...
##   ..- attr(*, "spec")=List of 2
##   .. ..- attr(*, "class")= chr "col_spec"
##  $ imp4:'data.frame':    207 obs. of  14 variables:
##   ..$ country               : chr [1:207] "Afghanistan" "Albania" "Algeria" "American.Samoa" ...
##   ..$ region                : chr [1:207] "Asia" "Europe" "Africa" "Asia" ...
##   ..$ tfr                   : num [1:207] 6.9 2.6 3.81 1.66 NA ...
##   ..$ contraception         : num [1:207] 19.9 35.3 52 66.9 NA ...
##   ..$ educationMale         : num [1:207] 6.52 5.18 11.1 13.19 NA ...
##   ..$ educationFemale       : num [1:207] 2.28 5.72 9.9 13.49 NA ...
##   ..$ lifeMale              : num [1:207] 45 68 67.5 68 NA ...
##   ..$ lifeFemale            : num [1:207] 46 74 70.3 73 NA ...
##   ..$ infantMortality       : num [1:207] 154 32 44 11 NA 124 24 22 25 6 ...
##   ..$ GDPperCapita          : num [1:207] 2848 863 1531 4049 NA ...
##   ..$ economicActivityMale  : num [1:207] 87.5 89.5 76.4 58.8 NA ...
##   ..$ economicActivityFemale: num [1:207] 7.2 34 7.8 42.4 NA ...
##   ..$ illiteracyMale        : num [1:207] 52.8 16.589 26.1 0.264 NA ...
##   ..$ illiteracyFemale      : num [1:207] 85 19.43 51 0.36 NA ...
##   ..- attr(*, "spec")=List of 2
##   .. ..- attr(*, "class")= chr "col_spec"
##  $ imp5:'data.frame':    207 obs. of  14 variables:
##   ..$ country               : chr [1:207] "Afghanistan" "Albania" "Algeria" "American.Samoa" ...
##   ..$ region                : chr [1:207] "Asia" "Europe" "Africa" "Asia" ...
##   ..$ tfr                   : num [1:207] 6.9 2.6 3.81 2.18 NA ...
##   ..$ contraception         : num [1:207] 20.6 32.5 52 71.5 NA ...
##   ..$ educationMale         : num [1:207] 7.37 10.27 11.1 15.1 NA ...
##   ..$ educationFemale       : num [1:207] 4.22 10.08 9.9 16.11 NA ...
##   ..$ lifeMale              : num [1:207] 45 68 67.5 68 NA ...
##   ..$ lifeFemale            : num [1:207] 46 74 70.3 73 NA ...
##   ..$ infantMortality       : num [1:207] 154 32 44 11 NA 124 24 22 25 6 ...
##   ..$ GDPperCapita          : num [1:207] 2848 863 1531 15980 NA ...
##   ..$ economicActivityMale  : num [1:207] 87.5 73.7 76.4 58.8 NA ...
##   ..$ economicActivityFemale: num [1:207] 7.2 33.9 7.8 42.4 NA ...
##   ..$ illiteracyMale        : num [1:207] 52.8 15.825 26.1 0.264 NA ...
##   ..$ illiteracyFemale      : num [1:207] 85 24.52 51 0.36 NA ...
##   ..- attr(*, "spec")=List of 2
##   .. ..- attr(*, "class")= chr "col_spec"
##  - attr(*, "class")= chr [1:2] "mi" "list"

MI scatterplot

## $imp1

## 
## $imp2

## 
## $imp3

## 
## $imp4

## 
## $imp5

Conducting inference

## # A tibble: 20 × 6
##       id              term estimate std.error statistic   p.value
##    <chr>             <chr>    <dbl>     <dbl>     <dbl>     <dbl>
##  1  imp1       (Intercept)  6.47310   0.16284     39.75  1.30e-96
##  2  imp1 log(GDPperCapita) -0.20206   0.02988     -6.76  1.47e-10
##  3  imp1     contraception -0.00480   0.00241     -2.00  4.72e-02
##  4  imp1   educationFemale -0.14254   0.01799     -7.92  1.60e-13
##  5  imp2       (Intercept)  6.44744   0.14350     44.93 8.28e-106
##  6  imp2 log(GDPperCapita) -0.20265   0.02722     -7.45  2.91e-12
##  7  imp2     contraception -0.00596   0.00206     -2.90  4.18e-03
##  8  imp2   educationFemale -0.13358   0.01461     -9.14  7.46e-17
##  9  imp3       (Intercept)  6.57374   0.15260     43.08 3.79e-103
## 10  imp3 log(GDPperCapita) -0.20811   0.02774     -7.50  2.00e-12
## 11  imp3     contraception -0.00507   0.00224     -2.26  2.48e-02
## 12  imp3   educationFemale -0.14579   0.01707     -8.54  3.42e-15
## 13  imp4       (Intercept)  6.49875   0.17864     36.38  1.29e-89
## 14  imp4 log(GDPperCapita) -0.21912   0.03250     -6.74  1.68e-10
## 15  imp4     contraception -0.00710   0.00228     -3.11  2.14e-03
## 16  imp4   educationFemale -0.11895   0.01596     -7.45  2.75e-12
## 17  imp5       (Intercept)  6.52708   0.16142     40.44  1.24e-97
## 18  imp5 log(GDPperCapita) -0.21800   0.03058     -7.13  1.84e-11
## 19  imp5     contraception -0.00650   0.00218     -2.98  3.27e-03
## 20  imp5   educationFemale -0.12604   0.01722     -7.32  6.09e-12

Conducting inference

##                term estimate std.error estimate.mi std.error.mi
## 1       (Intercept)   6.8840   0.29039     6.50402      0.16896
## 2 log(GDPperCapita)  -0.2943   0.05765    -0.20999      0.03097
## 3     contraception  -0.0113   0.00424    -0.00588      0.00247
## 4   educationFemale  -0.0770   0.03378    -0.13338      0.02064

Missingness map

Transforming variables

Transforming variables

Transforming variables

## Warning: There are observations in the data that are completely missing. 
##          These observations will remain unimputed in the final datasets. 
## -- Imputation 1 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32
## 
## -- Imputation 2 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44
## 
## -- Imputation 3 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
##  61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
##  81 82 83 84 85 86 87 88 89 90 91
## 
## -- Imputation 4 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42
## 
## -- Imputation 5 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51 52 53 54

New model results

## # A tibble: 20 × 6
##       id              term estimate std.error statistic   p.value
##    <chr>             <chr>    <dbl>     <dbl>     <dbl>     <dbl>
##  1  imp1       (Intercept)   6.4507   0.16693     38.64  5.43e-95
##  2  imp1 log(GDPperCapita)  -0.2271   0.03471     -6.54  4.88e-10
##  3  imp1     contraception  -0.0101   0.00239     -4.23  3.56e-05
##  4  imp1   educationFemale  -0.0955   0.02010     -4.75  3.82e-06
##  5  imp2       (Intercept)   6.5306   0.17201     37.97  1.24e-93
##  6  imp2 log(GDPperCapita)  -0.3042   0.03822     -7.96  1.26e-13
##  7  imp2     contraception  -0.0146   0.00248     -5.89  1.56e-08
##  8  imp2   educationFemale  -0.0285   0.02346     -1.21  2.26e-01
##  9  imp3       (Intercept)   6.4457   0.17031     37.85  2.15e-93
## 10  imp3 log(GDPperCapita)  -0.2437   0.03737     -6.52  5.59e-10
## 11  imp3     contraception  -0.0153   0.00225     -6.83  9.83e-11
## 12  imp3   educationFemale  -0.0606   0.02074     -2.92  3.89e-03
## 13  imp4       (Intercept)   6.1839   0.16821     36.76  3.57e-91
## 14  imp4 log(GDPperCapita)  -0.1512   0.03652     -4.14  5.11e-05
## 15  imp4     contraception  -0.0109   0.00221     -4.91  1.90e-06
## 16  imp4   educationFemale  -0.1260   0.01892     -6.66  2.56e-10
## 17  imp5       (Intercept)   6.4678   0.15379     42.06 1.44e-101
## 18  imp5 log(GDPperCapita)  -0.2257   0.02989     -7.55  1.48e-12
## 19  imp5     contraception  -0.0137   0.00206     -6.65  2.73e-10
## 20  imp5   educationFemale  -0.0804   0.01562     -5.15  6.23e-07
##                term estimate std.error estimate.mi std.error.mi
## 1       (Intercept)   6.8840   0.29039      6.4158      0.22183
## 2 log(GDPperCapita)  -0.2943   0.05765     -0.2304      0.06954
## 3     contraception  -0.0113   0.00424     -0.0129      0.00341
## 4   educationFemale  -0.0770   0.03378     -0.0782      0.04483